A memory-efficient data structure representing exact-match overlap 
graphs with application for next generation DNA assembly * 



The maximal exact-match overlap of two strings x and y, denoted by ov max (x,y), is the 
longest string which is a suffix of x and a prefix of y. The exact-match overlap graph of n given 
strings of length £ is an edge-weighted graph in which each vertex is associated with a string and 
there is an edge (x, y) of weight u = t — \ov max (x, y)\ if and only if lj < A, where \ov max (x, y)\ is 
the length of ov max (x, y) and A is a given threshold. In this paper, we show that the exact- match 
overlap graphs can be represented by a compact data structure that can be stored using at most 
(2A — l)(2|~logn] + [log A] )n bits with a guarantee that the basic operation of accessing an edge 
takes O(logA) time. We also propose two algorithms for constructing the data structure for 
the exact-match overlap graph. The first algorithm runs in 0(X£n\ogn) worse-case time and 
requires O(A) extra memory. The second one runs in 0(X£n) time and requires 0(n) extra 
memory. 

Exact-match overlap graphs have been broadly used in the context of DNA assembly and the 
shortest super string problem where the number of strings n ranges from a couple of thousands to 
a couple of billions, the length I of the strings is from 25 to 1000, depending on DNA sequencing 
technologies. However, many DNA assemblers using overlap graphs are facing a major problem 
of constructing and storing them. Especially, it is impossible for these DNA assemblers to handle 
the huge amount of data produced by the next generation sequencing technologies where the 
number of strings n is usually very large ranging from hundred million to a couple of billions. If 
a graph is explicitly stored, it would require fi(n 2 ) memory, which is impossible in practice in 
the case that n is greater than hundred million. In fact, to our best knowledge there is no DNA 
assemblers that can handle such a large number of strings. Fortunately, with our compact data 
structure, the major problem of constructing and storing overlap graphs is practically solved 
since it only requires linear time and and linear memory. As a result, it opens the door of 
possibilities to build a DNA assembler that can handle large-scale datasets efficiently. 
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Abstract 



1 Introduction 



An exact-match overlap graph of n given strings of length I is an edge-weighted graph defined 
informally as follows. Each vertex is associated with a string and there is an edge (x, y) of weight 
lo = t — \ov max (x,y)\ if and only if oj < A, where A is a given threshold and \ov max {x 1 y)\ is the 
length of the maximal exact-match overlap of two strings x and y. The formal definition of the 
exact-match overlap graph is given in Section [2j 

Storing the exact-match overlap graphs efficiently in term of memory becomes essential when 
the number of strings is very large. In the literature, there are two common data structures to store 
a general graph G = (V, E). The first data structure uses a two-dimensional array of size \V\ x \V\. 
We call it array-based data structure. One of its advantages is that the time of accessing a given 
edge is 0(1). However, it requires 0(|y| 2 ) memory. The second data structure stores the set of 
edges E. We call it edge-based data structure. Of course, it requires fJ(|V| + memory and the 
time of accessing a given edge is 0(log A), where A is the degree of the graph. Both of these data 
structures require memory. If the exact-match overlap graphs are stored by these two data 

structures, we will need memory. Even this much of memory may not be feasible in the case 

that the number of strings is over hundred millions. In this paper we focus on data structures for 
the exact-match overlap graphs that will need much less memory than \E\. 

1.1 Our contributions 

We show that there is a compact data structure representing the exact-match overlap graph that 
needs much less memory than \E\ with a guarantee that the basic operation of accessing an edge 
takes 0(log A) time, which is almost a constant in the context of DNA assembly. The data structure 
can be constructed efficiently in time and memory as well. In particular, we show that 

• The data structure takes no more than (2A — l)(2[logn] + [~logA])n bits. 

• The data structure can be constructed in 0(X£n) time. 

As a result, any algorithm using overlap graphs can be simulated by our compact data structure 
with no more (2A — l)(2[logn] + [~logA])n bits for storing the overlap graph and paying extra 
O(logA) time factor overhead. Apparently, if A is a constant or much much smaller than n, our 
data structure will be a perfect solution for any application that does not have enough memory for 
storing the overlap graph in traditional way. 

Our claim may sound contradictory because in some exact-match overlap graphs the number of 
edges can be Q,(n 2 ) and it seems like it should require at least Q(n 2 ) time and memory to construct 
them. Fortunately, because of some special properties of the exact-match overlap graphs, we can 
construct and store them efficiently. In Section [3l we will describe these special properties in detail. 

Briefly, the idea of storing the overlap graph compactly is from the following simple observation. 
If the strings are sorted in the lexicographic order, then for any string x the lexicographic orders 
of the strings that contain x as a prefix are in a certain integer range or integer interval [a, b]. 
Therefore, the information about out-neighborhood of a vertex can be described by at most A 
intervals. Such intervals have a nice property that they are either disjoint or contain each other. 
This property allows us to describe the out-neighborhood of a vertex by at most 2A — 1 disjoint 
intervals. Each interval costs 2[logn] + [log A] bits, where 2[logra] bits are for storing its two 
bounds and [log A] bits are for storing the weight. We have n vertices so the amount of memory 
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required by our data structure is no more than (2A — l)(2[logn] + [log A])n bits. Note that this is 
just an upper bound. In practice, the amount of memory may be much less than that. 

1.2 Application: DNA assembly 

The main motivation for the exact-match overlap graphs comes from their use in implementing 
fast approximation algorithms for the shortest super string problem which is the very first problem 
formulation for DNA assembly. The exact-match overlap graphs can be used for other problem 
formulations for DNA assembly as well. 

Exact-match overlap graphs have been broadly used in the context of DNA assembly and the 
shortest super string problem where the number of strings n ranges from a couple of thousands to 
a couple of billions, the length I of the strings is from 25 to 1000, depending on DNA sequencing 
technologies. However, many DNA assemblers using overlap graphs are facing a major problem of 
constructing and storing them. Especially, it is impossible for these DNA assemblers to handle the 
huge amount of data produced by the next generation sequencing technologies where the number 
of strings n is usually very large ranging from hundred million to a couple of billions. If a graph is 
explicitly stored, it would require f2(n 2 ) memory, which is impossible in practice in the case that 
n is greater than hundred million. In fact, to our best knowledge there is no DNA assemblers that 
can handle such a large number of strings. Fortunately, with our compact data structure, the major 
problem of constructing and storing overlap graphs is practically solved since it only requires linear 
time and linear memory. As a result, it opens the door of possibilities to build a DNA assembler 
that can handle large-scale datasets efficiently. 

1.3 Related work 

Gusfield et al. [GLS92J, |Gus97] consider the all-pairs suffix-prefix problem which is actually a 
special case of computing the exact-match overlap graphs when A = I. They devised an 0{tn-\-n 2 ) 
time algorithm for solving the all-pairs suffix-prefix problem. In this case, the exact-match overlap 
graph is a complete graph. So the run time of the algorithm is optimal if the exact-match overlap 
graph is stored in the common way. 

Although the run time of the algorithm by Gusfield et al. is theoretically optimal in that setting, 
it uses the generalized suffix tree which has two disadvantages in practice. The first disadvantage 
is that the space consumption of the suffix tree is quite large |Kur 99] . The second disadvantage is 
that the suffix tree usually suffers from a poor locality of memory references fOGlO] . Fortunately, 
Abouelhoda et al. [AKOQ4) proposed a suffix tree simulation framework that allows any algorithm 
using the suffix tree to be simulated by enhanced suffix arrays. Ohlebusch and Gog |OG10] made 
use of properties of the enhanced suffix arrays to devise an algorithm for solving the all-pairs 
suffix-prefix problem directly without using the suffix tree simulation framework. The run time 
of the algorithm by Ohlebusch and Gog is also 0(£n + n 2 ). Please note that our data structure 
and algorithm can be used to solve the suffix-prefix problem in 0(\£n) time. In the context of 
DNA assembly, A is typically much smaller than n and hence our algorithm will be faster than the 
algorithms of [Gus97j and [UGlOj . 

In the literature, exact-match overlap graphs should be distinguished from approximate-match 
overlap graphs which is considered in |Mye05| , [MGMB07], |Pop09 j . In the approximate-match 
overlap graph, there is an edge between two strings x and y if and only if there is a prefix of x, say 
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x', and there is a suffix of y, say y', such that the edit distance between x' and y' is no more than 
a certain threshold. 

2 Preliminaries 

Let £ be the alphabet. The size of S is a constant. In the context of DNA assembly, £ = 
{A, C, G, T}. The length of a string x on S, denoted by \x\, is the number of symbols in x. Let x[i] 
be the i-th symbol of string x, and x[i, j] be the substring of x between the i-th and the j positions. 
A prefix of string x is the substring x[l,i] for some i. A suffix of string x is the substring x[i, \x\] 
for some i. 

Given two strings x and y on E, an exact-match overlap between x and ?/, denoted by ov(x,y), 
is a string which is a suffix of x and a prefix of y (notice that this definition is not symmetric) . The 
maximal exact-match overlap between x and y, denoted by ov max (x, y), is the longest exact-match 
overlap between x and y. 

Exact-match overlap graphs: Given n strings s±, S2, ■ ■ ■ , s n and a threshold A, the exact- 
match overlap graph is an edge- weighted directed graph G = (V, E) in which there is a vertex 
Vi <G V associated with the string Sj, for 1 < i < n. There is an edge (vi,Vj) € E if and only if 
\si\-\ov max (si,Sj)\ < A. The weight of the edge (vi,Vj), denoted by uj(vi, vj), is \s i \-\ov max (s i ,Sj)\. 

^^^^^^^^^^^^^^^^^ 

I 

1 1 1 
u ^ j 

CO(S b Sj) < I \0V mm (Si, Sj)\ 

Figure 1: An example of an overlap edge. 

The set of out-neighbors of a vertex v is denoted by OutNeigh(v). The size of the set of out- 
neighbors of v, \OutNeigh(v)\, is called the out-degree of v. We denote the out-degree of v as 
deg out (v) = \OutNeigh{v)\. 

For simplicity, we assume that all the strings s\, S2, ■ ■ ■ , s n have the same length I. Otherwise, 
let I be the length of the longest string and all else works. 

The operation of accessing an edge given its two endpoints: Given any two vertices 
Vi and Vj, the operation of accessing the edge (vi,Vj) is the task of returning u(vi,Vj) if (vi,Vj) is 
actually an edge of the graph, and returning NULL if (vi, Vj) is not. 

3 A memory-efficient data structure representing an exact-match 
overlap graph 

In this section, we describe a memory-efficient data structure to store an exact-match overlap graph. 
It only requires at most (2A — l)(2[log n \ + [log A])n bits. It guarantees that the time for accessing 
an edge, given two endpoints of the edge, is O(logA). This may sound like a contradictory claim 
because in some exact-match overlap graphs the number of edges can be 17(n 2 ) and it seems like 
it should require at least Vt{n 2 ) time and space to construct them. Fortunately, because of some 
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special properties of the exact-match overlap graphs, we can construct and store them efficiently. 
In the following paragraphs, we will describe these special properties. 

Without loss of generality, we assume that the n input strings s\, S2, ■ ■ ■ , s n are sorted in lex- 
icographic order. We can assume this because if they are not sorted, we can sort them by using 
the radix sort algorithm which runs in 0(£n) time. The algorithm radix sort takes 0(£n) time 
in this case because we consider the constant alphabet size. Otherwise, it would take additional 
0(|S| log(jS|)) time to sort the alphabet. 

Each string Si and its corresponding vertex Vi in the exact-match overlap graph are determined 
by the string's lexicographic order i. We refer to the lexicographic order of any string as its iden- 
tification number. We will access an input string and its vertex through its identification number. 
Therefore, the identification number and the vertex of an input string are used interchangeably. 
Also, it is not hard to see that we need [logn] bits to store an identification number. We have the 
following properties. 

Given an arbitrary string x, let PREFIX(x) be the set of identification numbers such that x 
is a prefix of their corresponding input strings. Formally, PREFIX(x) = {i\x is a prefix of s^}. 

Property 1. If PREFIX (x) ^ 0, then PREFIX (x) = [a,b], where [a,b] is some integer interval 
containing integers a, a + 1, . . . , b — 1, b. 

Proof. Let a = txi^^prefix(x) i an d b = Txiax iePREFI x( x ) Clearly, PREFIX(x) C [a, b}. On 
the other hand, we will show that [a, b] C PREFIX (x). Let i be any identification number 
in the interval [a,b]. Since the input strings are in lexicographically sorted order, s a [l,[x|] < 
Si[l,\x\] < s b [l,\x\}. Since a G PREFIX(x) and b £ PREFIX(x), s a [l,\x\) = s b [l, \x\]. Thus, 
s a [l, \x\] = Si[l, \x\] = Sfe[l, \x\]. Therefore, x is a prefix of Sj. Hence, i € PREFIX{x). □ 

For example, let 

si = AAACCGGGGTTT 

s 2 = ACCCGAATTTGT 

s 3 = ACCCTGTGGTAT 

s 4 = ACCGGCTTTCCA 

s 5 = ACTAAGGAATTT 

s 6 = TGGCCGAAGAAG 

If x = AC, then PREFIX (x) = [2, 5]. Similarly, if x = ACCC, then PREFIX(x) = [2, 3]. 

Property [T] tells us that PREFIX(x) can be expressed by an interval which is determined by 
its lower bound and its upper bound. So we only need 2 [logn] bits to store PREFIX(x). In 
the rest of this paper, we will refer to PREFIX(x) as an interval. Also, given an identification 
number i, checking whether i is in PREFIX (x) can be done in 0(1) time. In the subsection 14. T\ 
we will discuss two algorithms computing PREFIX(x), for a given string x. The run times of 
these algorithms are 0(|x|logn) and 0(|x|), respectively. 

Property [1] leads to the following property. 

Property 2. OutNeigh(vi) = \J 1<uJ<x PREFIX (si[u> + 1, |sj|]) for each vertex V{. In the other 
words, OutNeigh(vi) is the union of at most A non-empty intervals. 
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Proof. Let Vj be a vertex in OutNeigh(vi). By the definition of the exact-match overlap graph, 
1 < \si\-\ov max (si,Sj)\ = (j(vi,Vj) < A. Letuj(si,Sj) = u>. Therefore, ov max (si, sj) = Si[u+1, |sj|] = 
Sj[l, \ov max (si, Sj)\]. This implies vj € PREFIX(si[u + 1, 

On the other hand, let Vj be any vertex in PREFIX(si[uj + 1, |sj|]), it is easy to check that 
Vj £ OutNeigh(vi). Hence, OutNeigh(vi) = Ui< w <a PREFIX(si[u; + 1, |s;|]). □ 

From Property El it follows that we can represent OutNeigh{vi) by at most A non-empty inter- 
vals, which need at most 2A [log n\ bits to store. Therefore, it takes at most 2nA [log n] bits to store 
the exact-match overlap graph. However, given two vertices Vi and Vj, it takes O(A) time to retrieve 
u(v{,Vj) because we have to sequentially check if Vj is in PREFIX(si[2, |sj|]), PREFIX (sj[3, |sj|]),. . . , 
PREFIX(si[X + 1, |sj|]). But if OutNeigh(vi) can be represented by k disjoint intervals then the 
task of retrieving co(vi,Vj) can be done in O(logfc) time by using binary search. In Lemma [H we 
show that OutNeigh(vi) is a union of at most 2A — 1 disjoint intervals. 

Property 3. For any two strings x and y with \x\ < \y\, then either one of the two following 
statements is true: 

• PREFIX(y) C PREFIX(x) 

• PREFIX{y) H PREFIX(x) = 

Proof. There are only two possible cases that can happen to x and y. 

Case 1: x is a prefix of y. For this case, it is not hard to infer that PREFIX{y) C PREFIX(x). 
Case 2: x is not a prefix of y. For this case, it is not hard to infer that PREFIX (y) P| PREFIX (x) = 
0. □ 

Lemma 1. Given A intervals [a±, bi], [a,2, 62] . . . [a\, b\] satisfied Property^ the union of them is 
the union of at most 2A — 1 disjoint intervals. Formally, there exist p < 2A — 1 disjoint intervals 
Wx,b'x], [a' 2 ,b' 2 ] . . . [a' p ,b' p ] such that \Ji<i<\[ a i> b i] = Ui<i< p [ a i, b 'i\- 

Proof. We say interval [a^, bi] is a parent of interval [aj, bj] if [<Zj, bi] is the smallest interval containing 
[dj, bj]. We also say interval [a,j, bj] is a child of interval [oj, bi]. Since the intervals [a^, bi] are either 
pairwise disjoint or contain each other, each interval has at most one parent. Therefore, the set 
of the intervals [aj, bi] form a forest in which each vertex is associated with an interval, see Figure 
[2j For each interval [aj,6j], let ij be the set of the maximal intervals that are contained in interval 
[ai,6i] but disjoint with all of its children. For example, if [aj,6j] = [1,20] and its child intervals 
are [3, 5], [7, 8] and [12, 15], then I { = {[1, 2], [6, 6], [9, 11], [16, 20]}. In the case the interval [a h b { ] is 
a leaf interval (an interval does not have any children) , Ij is simply the set containing only interval 
[a,i, bi]. Let A = Ui<j<A li- We will show that A is the set of the disjoint intervals [a^, b^] satisfying 
the condition of the lemma. 

Firstly, we show that Ui<j<A[ a *'^] = Ur a ' b'\GjS. a 'ii b i\- By the construction of 2j, it is triv- 
ial to see that Ufa' fe'lpyl[ a i' ^i] — Ui<i<A[ a «>^]- Conversely, it is enough to show that [aj,6j] C 

L i ' is ^- — — 

U[ a ' b'}&A\- a 'v K] ^ or an y ^ < i < X. This can be proved by induction on vertices in each tree of the for- 
est. For the base case, obviously each leaf interval [04, bi] is in A. Therefore, [0^ bA C |K , . [a[, b'A 
for any leaf interval [a^bj]. For any internal interval [aj,6j], assume that all of its child intervals 
are subsets of [Jr / b n GA [a[, b'A. By the construction of Jj, [c^, bi] is a union of all of the intervals in 
Ii and all of its child intervals. Therefore, [aj,6j] C (Jj a , b /j eA [a'i, b^]. 
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Secondly, we show that the intervals in A are pairwise disjoint. It is sufficient to show that 
any interval in Ii is disjoint with every interval in Ij for i ^ j. Obviously, the statement is true 
if [ai,&i] n [dj,bj] = 0. Let us consider the case where one contains the other. Without loss of 
generality, we assume that [dj,bj] C [eti, bi]. Consider two cases: 

Case 1: [a$, bi] is the parent of [a,-, &,•]. By the construction of Ij, any interval in Ij is disjoint with 
[cij,bj]. By the construction of Ij, any interval in Ij is contained in [aj,bj]. Therefore, they are 
disjoint. 

Case 2: [o$, bi] is not the parent of [a,j,bj]. Let [a,j,bj] = [a io ,b io ] C [a^bjj • • • C [a ih ,b ih ] = [a i: bi], 
where [0^,6^] is the parent of [oj t _ 1 , &i t _J- From the result in the Case 1, any interval in Jj t is 
disjoint with [oj t _ 1 , &* t _J for 1 < t < /i. So any interval in Ij is disjoint with [oj,6j]. We already 
know that any interval in Ij is contained in [aj,bj]. Thus, they are disjoint. 

Finally, we show that the number of intervals in A is no more than 2 A — 1. We have \A\ = 
It is easy to see that the number of intervals in Ii is no more than the number of children 
of [aj, bi] plus one, which is equal to the degree of the vertex associated with [oj, bi] if the vertex is 
not a root of a tree in the forest, and equal to the degree of the vertex plus one if the vertex is a 
root. Let q be the number of trees in the forest. Then, \A\ = X^=i \h\ — Hi=i di + q = 2\E\ + p, 
where di is the degree of the vertex associated with [04 , bi] and E is the set of the edges of the 
forest. We know that in a tree the number of edges is equal to the number of vertices minus one. 
Thus, \E\ = A — q. Therefore, \A\ < 2A — q < 2A — 1. This completes our proof. □ 

[1,93] [100,130] 




[1,20] [25,50] [110,120] 




[3,5] [7,8] [12, 15] [28,36] [40,47] [112, 114] [116, 120] 
[30, 33] 

Figure 2: A forest illustration in the proof of Lemma [TJ 

From the proof, an algorithm computing the disjoint intervals is straightforward by first con- 
structing the interval forest. Once the forest is built, outputting the disjoint intervals can be done 
easily at each vertex. However, designing a fast algorithm for constructing the forest is not triv- 
ial. In the subsection 14.21 we wm discuss an 0(A log A)-time algorithm for constructing the forest. 
Thereby, there is an 0(A log A)-time algorithm for computing the disjoint intervals [a£, b^] in Lemma 
[H given A intervals satisfying Property El Also, from Property [3] and Lemma [H it is not hard to 
prove the following theorem. 

Theorem 1. OutNeigh(vi) is the union of at most 2A— 1 disjoint intervals. Formally, OutN eigh(vi) = 
Ui<m<p[ a m,b m ] where p < 2A - 1, [a m , b m ) f\[a m >, b m >] = for 1 < m j= m! < p. Furthermore, 
Ufa, Vj) = io(vi,v k ) for anyl<m<p and for any v i} v k G [a m , b m ]. 
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Theorem Q] suggests a way of storing OutN eigh(vi) by at most (2A — 1) disjoint intervals. Each 
interval takes 2[logn] bits to store its lower bound and its upper bound, and [log A] bits to store 
the weight. Thus, we need 2[~logn] + [log A] to store each interval. Therefore, it takes at most 
(2A — l)(2[logn] + [log A]) bits to store each OutNeigh(vi). Overall, we need (2A — l)(2[logn] + 
[logA])n bits to store the exact-match overlap graph. Of course, the disjoint intervals of each 
OutN eigh{vi) are stored in the sorted order of their lower bounds. Therefore, the operation of 
accessing an edge (v{,Vj) can be easily done in O(logA) time by using binary search. 

4 Algorithms for constructing the compact data structure 

In this section, we describe two algorithms for constructing the data structure representing the 
exact-match overlap graph. The run time of the first algorithm is O(Xinlogn) and it only uses 
0(A) extra memory, besides £n[log bits memory used to store the n input strings. The second 
algorithm runs in 0(X£n) time and requires 0(n) extra memory. As shown in Section [31 the algo- 
rithms need two routines. The first routine computes PREFIX (x) and the second one computes 
the disjoint intervals described in Lemma [TJ 

4.1 Computing interval PREFIX(x) 

In this subsection, we consider the problem of computing the interval PREFIX (x), given a string 
x and n input strings s\, S2, ■ ■ ■ , s n of the same length £ in lexicographical order. We describe two 
algorithms for this problem. The first algorithm takes 0(|x|logn) time and O(l) extra memory. 
The second algorithm runs in 0(|x|) time and requires 0(n) extra memory. 

4.1.1 A binary search based algorithm 

Let [a h bi] = PREFIX (x[l,i]) for 1 < i < \x\. It is easy to see that PREFIX(x) = [a\ x \, b\ x \] C 
[ctui-i, ^ • • • C [ai,&i]. Consider the following input strings, for example. 





= AAACCGGGGTTT 


S2 


= ACCAGAATTTGT 


S3 


= ACCATGTGGTAT 


Si 


= ACGGGCTTTCCA 


«5 


= ACTAAGGAATTT 


S6 


= TGGCCGAAGAAG 


X 


= ACCA 



Then, [aiM] = [1, 5], [a 2 , 6 2 ] = [2, 5] , [a 3 , h] = [2,3] and PREFIX{x) = [04,64] = [2,3]. 

We will find [aj,6j] from [aj_i,6j_i] for i from 1 to \x\, where [ao,6o] = [l) n ] initially. Thereby, 
PREFIX (x) is computed. Let Coli be the string that consists of all the symbols at position 
i of the input strings. In the above example, C0I3 = ACCGTG. Observe that the symbols in 
string Coli[ai-i,bi-i] are in lexicographical order for 1 < i < \x\. Thus, any symbol in the string 
Co/j[aj_i, 6j_i] appears consecutively. Another observation is that [aj,6j] is the interval where the 
symbol x[i] appears consecutively in string CoZj[aj_i, 6j_i]. Therefore, [oj, 6j] is determined by 
searching for the symbol x[i] in the string CoZj[aj_i, 6j_i]. This can be done easily by first using 
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the binary search to find a position in the string CoZi[oj_x, 6j_i] where the symbol x[i] appears. If 
the symbol x[i] is not found, we return the empty interval and stop. If the symbol x[i] is found 
at position Cj, then ai (respectively bi) can be determined by using the double search routine in 
string Coli[cii-i,Ci] (resp. string CoZjjcj, as follows. We consider the symbols in the string 
Coli[ai-i,Ci] at positions q — 2°, c, — 2 1 , . . . , c, — 2 fc , aj_i, where = [log(Q — aj_i)J . We find j 
such that the symbol Coli[c{ — 2 J ] is the symbol x[i] but the symbol Coli[ci — 2- ?+1 ] is not. Finally, 
a, is determined by using binary search in string Coli[ci — 2 J , a — 2-?' +1 ]. Similarly, bi is determined. 
The pseudo-code is given as follows. 

1: Initialize [ao,&o] = [l; 77 -]- 
2: for i = 1 to |x| do 

3: Find the symbol x[i] in the string Coli[a,i-i,bi-i] using binary search. 

4: if the symbol appears in the string Co/j[aj_i, then 

5: Let Cj be the position of the symbol x[i] returned by the binary search. 

6: Find a, by double search and then binary search in the string Co/j[aj_i, q]. 

7: Find 6j by double search and then binary search in the string Coli[ci, 

8: else 

9: Return the empty interval 0. 
10: end if 
11: end for 

12: Return the interval [oui, b\ x \]. 

Analysis: As we discussed above, it is easy to see the correctness of the algorithm. Let us 
analyze the memory and time complexity of the algorithm. Since the algorithm only uses binary 
search and double search, it needs O(l) extra memory. For time complexity, it is easy to see 
that computing the interval [ai,bi] at step i takes 0(log(6j_i — aj-i)) < O(logn) time because 
both binary search and double search take 0(log(6j_i — aj_i)) time. Overall, the algorithm takes 
0(\x\ logn) time because there are at most \x\ steps. 

4.1.2 A trie-based algorithm 

As we have seen in Subsection 14.1.14 to compute the interval [oj, bi] for symbol x[i], we use binary 
search to find the symbol x[i] in the interval [aj_i,6j_i]. The binary search takes 0(log(6j_i — 
o-i-lj) < O(logn) time. We can reduce the O(logn) factor to 0(1) in computing the interval [aj,6j] 
by pre-computing all of the intervals for each symbol in the alphabet S and store them in a trie. 
Given the symbol x[i], to find the interval [aj, bi] we just retrieve it from the trie, which takes 0(1) 
time. The trie is defined as follows (see Figure [3]). At each node in the trie, we store a symbol and 
its interval. Observe that we do not have to store the nodes that have only one child. These nodes 
form chains in the trie. We will remove such chains and store their lengths in each remaining node. 
As a result, each internal node in the trie has at least two children. Because each internal node has 
at least two children, the number of nodes in the trie is no more than twice the number of leaves, 
which is equal to In. Therefore, we need 0(n) memory to store the trie. Also, it is well-known 
that the trie can be constructed recursively in 0(£n) time. 

It is easy to see that once the trie is constructed, the task of finding the interval [cij, bi] for each 
symbol x[i] takes 0(1) time. Therefore, computing PREFIX(x) will take 0(|x|) time. 
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Figure 3: An illustration of a trie for the example input strings in Subsection 14.1.11 



4.2 Computing the disjoint intervals 

In this subsection, we consider the problem of computing the maximal disjoint intervals, given 
k intervals [ai,6i], [02,62])- ■ ■ j [ a fc>6fc] which either are pairwise disjoint or contain each other. As 
discussed in Section [3l it is sufficient to build the forest of the k input intervals. Once the forest is 
built, outputting the maximal disjoint intervals can be done easily at each vertex of the forest. 

The algorithm for the problem is described as follows. First we sort the input intervals in non- 
decreasing order of their lower bounds Oj. Among those intervals whose lower bounds are equal, 
we sort them in decreasing order of their upper bounds bi. So after this step, we have 1) a\ < ai < 
• • • < 0^ and 2) if Oj = a,j then 6j > bj for 1 < i < j < k. Since the input intervals either are pairwise 
disjoint or contain each other, there are only two possibilities happening to two intervals [oj, hi] and 
[oj + i,6j + i] for 1 < i < k. Either [oj,6j] contains [044-1,644-1] or they are disjoint. Observe that if 
[aj,6j] contains [044-1, 644.1], then [oj, bi] is actually the parent of [flj+i, 644.1]. If they are disjoint, 
then the parent of [oj+i, 6j+i] is the smallest ancestor of [aj, bi] that contains [044-1, 644.1]. If such an 
ancestor does not exist, then [aj+i, 644-1] does not have a parent. Let A4 = {[a^, b^], . . . , [aj m , bi m }} 
be the set of ancestors of [oj, bi] , where i\ < ■ ■ ■ < i m . It is easy to see that [o^ , b^] C • ■ ■ C [a, m , bi m ]. 
Therefore, the smallest ancestor of [oj,6j] that contains [014-1,64+1] can be found by binary search, 
which takes at most 0(log k) time. Furthermore, assume that [of. , &».] is the smallest ancestor, then 
the set of ancestors of [oj+i, 6j+i] is Ai + \ = {[0^,6^], . . . , [oi-,6j.]}. Based on these observations, 
the algorithm can be described by the following pseudo-code. 
1: Sort the input intervals [aj,6j] as described above. 

2: Initialize A = 0. /* A is the set of ancestors of current interval [aj,6j] */ 

3: for i = 1 to k — 1 do 

4: if [ai,&j] contains [oj + i,6j + i] then 

5: Output [oj, bi] is the parent of [ai + \,bi + i]. 

6: Add [oj+i, bi + i] into A. 

7: else 

8: Assume that A = {[a^AJ, . . . , [a im ,b im ]}. 

9: Find the smallest interval in A that contains [044.1,644-1]. 
10: if the smallest interval is found then 
11: Assume that the smallest interval is [a^,6^]. 



9 



12: Output [ciijybij] is the parent of [di+i, 64+1]. 

13: Set A = {[a^^jj, . . . , [oi^bij], [a i+1 , b i+1 }} . 

14: else 

15: Set A = {[a i+1 ,b i+1 ]}. 

16: end if 
17: end if 
18: end for 

Analysis: As we argued above, the algorithm is correct. Let us analyze the run time of the 
algorithm. Sorting the input intervals takes 0{k) time by using integer sort since the lower bounds 
are integers. It is easy to see that finding the smallest interval from the set A dominates the running 
time at each step of the loop, which takes Oilogk) time. Obviously, there are k steps so the run 
time of the algorithm is 0(klogk) overall. 

4.3 Algorithms for constructing the compact data structure 

In this subsection, we describe two complete algorithms constructing the data structure. The 
algorithms will use the routines in subsection 14.11 and subsection 14.21 The only difference between 
these two algorithms is the way of computing PREFIX . The first algorithm uses the routine based 
on binary search to compute PREFIX , meanwhile, the second one uses the trie-based routine. 
The following pseudo code describes the first algorithm. 

1: for i = 1 to n do 

2: for j = 2 to A + 1 do 

3: Compute PREFIX{si[j, |sj|]) by the routine based on binary search in Subsection 14.1.11 
4: end for 

5: Output the disjoint intervals from the input intervals PREFIX(si[2, \si\]), . . . , PREFIX{si[\+ 

1, \ si\]) by using the routine in Subsection 14.21 
6: end for 

Let us analyze the time and memory complexity of the first algorithm. Each computation 
of PREFIX in line 3 takes 0{£\ogn) time and O(l) extra memory. So the loop of line 2 takes 
0{\£ log n) time and 0(A) extra memory. Computing the disjoint intervals in line 5 takes 0(A log A) 
time and 0(A) extra memory. Since A < £, the run time of the loop 2 dominates the run time of 
each step of loop 1. Therefore, the algorithm takes O(A^nlogn) time and 0(A) extra memory in 
total. 

The second algorithm is described by the same pseudo code above except for the line 4 where 
the routine in Subsection 14.1.11 computing PREFIX(si[j, |sj|]) is replaced by the trie-base routine 
in Subsection 14.1.21 Let us analyze the second algorithm. Computing PREFIX in line 4 takes 
0{£) time instead of 0(£logn) as in the first algorithm. With a similar analysis to that of the first 
algorithm, the loop of line 2 takes 0(X£n) time and O(A) extra memory. Constructing the trie in 
line 1 takes 0{£n) time. Therefore, the algorithm runs in 0{\£n) time. We also need 0{n) extra 
memory to store the trie. In many cases, n is much larger than A. So the algorithm takes 0(n) 
extra memory. 
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5 Conclusions 



We have described a memory efficient data structure that represents the exact-match overlap graph. 
We have shown that this data structure needs at most (2A — l)(2|~logn] + |"logA])n bits, which is a 
surprising result because the number of edges in the graph can be Q(n 2 ). Also, it takes 0(log A) time 
to access an edge through the data structure. We have proposed two fast algorithms to construct 
the data structure. The first algorithm is based on binary search and runs in O(A^nlogn) time and 
takes 0(A) extra memory. The second algorithm, based on the trie, runs in 0{\in) time, which 
is slightly faster than the first algorithm, but it takes 0(n) extra memory to store the trie. The 
nice thing about the first algorithm is that the memory it uses is mostly the memory of the input 
strings. This feature is very crucial for building an efficient DNA assembler. Speaking of DNA 
assembly, our data structure will definitely help building a DNA assembler that can handle very 
large scale datasets. In the future, we would like to exploit our data structure to speed up some 
operations on the exact-match overlap graphs that are commonly used in a DNA assembler such 
as removing transitive edges, greedily walking on the graph, extracting all of the chains, etc. 
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